Autoselection Failsafe Controls

This document contains information about how the BigFix system uses ICMP during the process of relay selection and also contains information on how to set these settings for deployments of different sizes. Improperly configured ICMP settings can lead to excessive amounts of ICMP in certain failure situations.

Technical Details

The BES Agents uses the network protocol ICMP during the “relay selection” process when the agent decides which BES Relay (or BES Server) to use as its parent. There are two types of relay selection:

Manual Relay Selection (default): During manual selection, BES Agents will first try their “primary relay” then try their “secondary relay”. If both are unavailable, the agent will try the “failover relay” and then as a last resort try the main BES Server. The primary, secondary, and failover relay must all be specified by a BES administrator. During manual relay selection, the BES Client sends an ICMP packet at the Maximum TTL to first its primary, then secondary, and lastly its failover BES Relays prior to attempting registration (very similar to the “tracert” algorithm). If the ICMP does not reach a BES Relay the BES Client will not attempt to register with it. If the ICMP ping is successful and the BES Client registers with the BES Relay the hop count determined by the ICMP packet is reported in the Distance to BES Relay property.
Automatic Relay Selection: During automatic relay selection, the BES Client sends out rounds of ICMP traffic with a constant TTL in each round. Each round of will send an ICMP packet to every BES Relay (BES Relays are listed in the ActionSite's Relay.dat file). Each round will each use a TTL at a higher value then the previous round, starting at 0 and skipping values at higher TTL values. BES Agents will find their closest BES Relay in terms of these network hops and determine which BES Relay is the closest on the network. For larger networks with many BES Relays, the ICMP traffic can potentially cause network congestion during certain failure situations if the ICMP settings are not set properly. Care must be taken to set the appropriate ICMP settings based on the deployment/network information and these settings should be periodically reviewed.

Benefits of Autoselection

Automatic relay selection is a feature in BigFix that is designed to help the BES Agents find their optimal BES Relays. Autoselection provides the following benefits:

Autoselection allows the agents to find their optimal BES Relays, which will limit bandwidth consumed on WAN links.
Autoselection allows agents to find nearby relays and increase the speed of deploying packages such as applications or patches.
Autoselection will help prevent network outages caused by agents downloading packages from inappropriate relays and thus possibly saturating network WAN links.

Autoselection removes the need to manually point each BES Agent at each BES Relay (a very time-consuming and error-prone process).
Autoselection automatically provides load-balancing, failover to next best relay, and dynamically re-adjusts as the network changes or relay infrastructure changes.
Autoselection allows roaming computers to select the appropriate local BES Relay at each location.

Autoselection Risks

There are two main concerns with using ICMP in an enterprise network, the bandwidth consumed and overloading routers. Bandwidth is a major concern for deployments with large numbers of BES Clients because large numbers of BES Clients simultaneously running relay selection may cause network congestion. The main concern for routers is consuming too much CPU processing ICMP traffic. Routers will typically need to use more CPU processing a TTL of zero. Routers may be CPU constrained before a network link is bandwidth constrained.

--In some large deployments, ICMP traffic sent from agents during autoselection can cause potentially cause network problems (including high router load) in certain rare failure scenarios if the agents are not configured properly.

To mitigate these risks, it is very important to constrain the number of ICMP packets sent by setting configuration settings in the BES Agent. These settings will control how often ICMP packets are sent, how many are sent, and how the agent handles failure situations (like its relay becoming temporarily unavailable).

Manual relay selection will generate negligible amounts of ICMP traffic and is considered to have no risk for generating too many ICMP packets.

+A BES Server failure event can generate a higher then normal amount of ICMP traffic. If the BES Server is down, BES Client posts will begin to fill up in BES Relay FillDB folders and BES Clients will not be able to register with BES Relays. Once BES Relays have reached their FillDB buffer directory maximum capacity they will begin to reject BES Client posts. Finally, once BES Clients reach their Resist Failure Interval they will begin to run automatic relay selection. If the BES Server is down, automatic relay selection will fail until the BES Server is available again unless a failover BES Server is available.

Recommended Settings

Definitions:

Number of computers

Small: <5000
Medium: 5,000-20,000
Large: 20,000-50,000
Very Large: 50,000+

Geographic Distribution

centralized: <50 relays
slightly distributed: 50-200
moderately distributed: 200-1,000
highly distributed: 1,000-3,000

Setting Values:

Name: _BESClient_RelaySelect_MaximumTTLToPing

Default: 255 (Hops)

Description: This value represents the maximum TTL to use in automatic relay selection. The agent sends ICMP packets to relays with increasing TTLs until it reaches this value. For example, if this value is set to 10, then the agent will send ICMP packets with a TTL of 2, 3,4,...,8,9 . In this case, the last ICMP packet sent will have a TTL of 9 (the last TTL sent is 1 less than the MaximumTTLtoPing), which means that the packet will not pass the 9^th router (and the “Distance to relay” property will never report more than 8).

Tradeoffs: A higher TTL value will allow the BES Client to find BES Relays that are farther away. At the default of 255, a BES Client would be able to reach any computer in practically any network. Higher MaxTTL values will generate more ICMP traffic during automatic relay selection because the BES Client will send “rounds” of ICMP packets until the maximum TTL is reached. A smaller TTL will generate fewer ICMP packets but BES Clients will only be able to find BES Relays that are closer in terms of network hops. If a BES Client is unable to find a relay at a distance less than the MaxTTL, it will attempt to select its failover relay.

Recommendation: The MaxTTL is one of the primary controls for limiting ICMP. For smaller or centralized networks, the ICMP traffic generated by autoselection can be handled by the network, but at larger more distributed deployments, the volume of ICMP packets grows dramatically in proportion to the number of relays, and much more care needs to be taken.

	Small	Medium	Large	Very Large
Centralized	30	20	20	10
slightly distributed	20	10	10	8
moderately distributed	8	6	3	2*
highly distributed	6	3	2*	2*

* For the largest most distributed customers, it is recommended that autoselection policies be reviewed with Bigfix support. It may not be possible to use relay autoselection on some portions of the network.

_BESClient_RelaySelect_IntervalSeconds

Default: 21600 (Seconds)

Description: The BES Relay selection algorithm will run periodically as specified by this setting. This allows the agent to find a more optimal relay than its current relay.

Tradeoffs: A smaller relay selection interval will allow BES Clients to find closer BES Relays more frequently. For example, if a new BES Relay is installed, the BES Clients will notice only when they do autoselection. Large values minimize the number of ICMP packets in aggregate, but small values allow for faster times to optimal relay selection.

Recommendations: Large deployments should increase this interval significantly to keep the average amount of ICMP down. Smaller deployments can keep the value lower to make sure that BES Clients maintain relay optimality.

	Small	Medium	Large	Very Large
centralized	21600 (6 hours)	21600 (6 hours)	86400 (1 day)	86400 (1 day)
slightly distributed	21600 (6 hours)	43200 (12 hours)	86400 (1 day)	129600 (1.5 days)
moderately distributed	43200 (12 hours)	86400 (1 day)	259200 (3 days)	259200 (3 days)
highly distributed	86400 (1 day)	259200 (3 days)	604800 (7 days)	604800 (7 days)

Name: _BESClient_RelaySelect_ResistFailureIntervalSeconds

Default: 600 (Seconds)

Description: This value represents the amount of time BES Clients will wait after its relay appears down before performing BES Relay selection. The BES Clients will notice when they send data (post) to the BES Relays that it is no longer accepting posts. If the agent fails twice to post, it will consider the BES Relay to be unavailable. This ResistFailure setting is how long the agent waits until running autoselection once it considers the BES Relay to be unavailable. The interval begins starting at the time of the first failed post.

Tradeoffs: A lower failure interval will allow BES Clients to quickly find alternative BES Relays in the event that a BES Relay is not available. This will give BES Clients a higher connectivity rate when BES Relays are uninstalled or having communication failures. A higher value will allow more resilience if the BES Relays or BES Server is unavailable.

Recommendation: Larger deployments should have higher values to allow more time to recover in the event of a failure before agents run autoselection. Smaller deployments will benefit from a shorter resist failure value (as long as ICMP caused by the BES Server being down is not problematic).

	Small	Medium	Large	Very Large
centralized	600 (10 min)	1800 (30 min)	3600 (1hours)	3600 (1hours)
slightly distributed	1200 (20 min)	3600 (1hours)	3600 (1hours)	7200 (2 hours)
moderately distributed	3600 (1hours)	7200 (2 hours)	21600 (6 hours)	21600 (6 hours)
highly distributed	7200 (2 hours)	14400 (4 hours)	21600 (6 hours)	21600 (6 hours)

Name: _BESClient_RelaySelect_MinRetryIntervalSeconds

Default: 60 (Seconds)

Description: If the automatic relay selection fails (no BES Relays were found), the BES Client will try again after this many seconds. The BES Client will double this value on each successive retry that fails to locate a BES Relay. For relay selection to succeed, the BES Client must be able to find and register with a relay or the main BES Server. (BES Clients will fail to find any BES Relay if the BES Server is unavailable).

Tradeoffs: A lower Minimum Retry Interval will allow the BES Client to run relay selection more often and find BES Relays faster once the failure is fixed. A higher value will generate fewer ICMP packets but make failure recovery slower. For example, if a laptop momentarily loses its network connection and can't find any BES Relays, a lower retry interval will allow it to quickly find a BES Relay once the connection is restored. On the other hand, if the BES Server is down causing BES Clients not to find any BES Relays, it would be best not to retry quickly due to the ICMP traffic.

Recommendation: Larger deployments should have higher values to allow more time between autoselection rounds. Smaller deployments will benefit from a shorter retry values to recover from failures faster.

	Small	Medium	Large	Very Large
centralized	600 (10 min)	1800 (30 min)	3600 (1hours)	3600 (1hours)
slightly distributed	1200 (20 min)	3600 (1hours)	3600 (1hours)	7200 (2 hours)
moderately distributed	3600 (1hours)	7200 (2 hours)	21600 (6 hours)	21600 (6 hours)
highly distributed	7200 (2hours)	14400 (4 hours)	21600 (6 hours)	21600 (6 hours)

Name: _BESClient_RelaySelect_MaxRetryIntervalSeconds

Default: 7200 (Seconds)

Description: After failing to find a BES Relay, the BES Client will continue to try to find a BES Relay. Each time it fails, the BES Client will double the time it spends until this maximum is exceeded. Then the BES Client will try with this maximum retry interval until it successfully selects a BES Relay.

Tradeoffs: A lower Maximum Retry Interval will allow BES Clients to recover from down times faster, while a higher value will force a longer recovery time. A lower value will cause more ICMP traffic since the BES Client runs relay selection more often.

	Small	Medium	Large	Very Large
centralized	7200 (2 hours)	14400 (4 hours)	28800 (8 hours)	28800 (8 hours)
slightly distributed	7200 (2 hours)	14400 (4 hours)	57600 (16 hours)	86400 (1 day)
moderately distributed	14400 (4 hours)	86400 (1 day)	129600 (1.5 days)	129600 (1.5 days)
highly distributed	28800 (8 hours)	129600 (1.5 days)	172800 (2 days)	172800 (2 days)